SpamBase dataset is a classification dataset containing 4601 emails sent to HP (Hewlett-Packard) during some period of time. The SpamBase contains numeric 57 features for each email and a binary label, 1 for spam, 0 for ham(email). This is a typical binary classification problem with the added need for clever feature selection, as many of the features provided in the dataset might be useless (Hopkins et al., 1998).
In this section, we use various python libraries like the pandas dataframe, numpy, matplotlib, and seaborn to do a data summary and data visualization of the spambase dataset. This will give us a better understanding of the distributions and relationships between various features and the target variable(label).
import numpy as np
import pandas as pd
#for data visualisation:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# read the file containing the column names
with open('spambase.names') as f:
list_contents = f.readlines()
colnames = []
for item in list_contents:
colname = item.split(':')[0]
colnames.append(colname)
colnames.append('label')
# get the length of the features(columns)
len(colnames)
58
# read the file containing the dataset and assign it to a variable
dataset = pd.read_csv('spambase.data', header=None)
dataset.columns = colnames
# get the first five rows of the dataset
dataset.head()
| word_freq_make | word_freq_address | word_freq_all | word_freq_3d | word_freq_our | word_freq_over | word_freq_remove | word_freq_internet | word_freq_order | word_freq_mail | ... | char_freq_; | char_freq_( | char_freq_[ | char_freq_! | char_freq_$ | char_freq_# | capital_run_length_average | capital_run_length_longest | capital_run_length_total | label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.00 | 0.64 | 0.64 | 0.0 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.000 | 0.0 | 0.778 | 0.000 | 0.000 | 3.756 | 61 | 278 | 1 |
| 1 | 0.21 | 0.28 | 0.50 | 0.0 | 0.14 | 0.28 | 0.21 | 0.07 | 0.00 | 0.94 | ... | 0.00 | 0.132 | 0.0 | 0.372 | 0.180 | 0.048 | 5.114 | 101 | 1028 | 1 |
| 2 | 0.06 | 0.00 | 0.71 | 0.0 | 1.23 | 0.19 | 0.19 | 0.12 | 0.64 | 0.25 | ... | 0.01 | 0.143 | 0.0 | 0.276 | 0.184 | 0.010 | 9.821 | 485 | 2259 | 1 |
| 3 | 0.00 | 0.00 | 0.00 | 0.0 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | 0.63 | ... | 0.00 | 0.137 | 0.0 | 0.137 | 0.000 | 0.000 | 3.537 | 40 | 191 | 1 |
| 4 | 0.00 | 0.00 | 0.00 | 0.0 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | 0.63 | ... | 0.00 | 0.135 | 0.0 | 0.135 | 0.000 | 0.000 | 3.537 | 40 | 191 | 1 |
5 rows × 58 columns
# get the last five rows of the dataset
dataset.tail()
| word_freq_make | word_freq_address | word_freq_all | word_freq_3d | word_freq_our | word_freq_over | word_freq_remove | word_freq_internet | word_freq_order | word_freq_mail | ... | char_freq_; | char_freq_( | char_freq_[ | char_freq_! | char_freq_$ | char_freq_# | capital_run_length_average | capital_run_length_longest | capital_run_length_total | label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4596 | 0.31 | 0.0 | 0.62 | 0.0 | 0.00 | 0.31 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000 | 0.232 | 0.0 | 0.000 | 0.0 | 0.0 | 1.142 | 3 | 88 | 0 |
| 4597 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000 | 0.000 | 0.0 | 0.353 | 0.0 | 0.0 | 1.555 | 4 | 14 | 0 |
| 4598 | 0.30 | 0.0 | 0.30 | 0.0 | 0.00 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.102 | 0.718 | 0.0 | 0.000 | 0.0 | 0.0 | 1.404 | 6 | 118 | 0 |
| 4599 | 0.96 | 0.0 | 0.00 | 0.0 | 0.32 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000 | 0.057 | 0.0 | 0.000 | 0.0 | 0.0 | 1.147 | 5 | 78 | 0 |
| 4600 | 0.00 | 0.0 | 0.65 | 0.0 | 0.00 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000 | 0.000 | 0.0 | 0.125 | 0.0 | 0.0 | 1.250 | 5 | 40 | 0 |
5 rows × 58 columns
# check datatype for all columns and rows
dataset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4601 entries, 0 to 4600 Data columns (total 58 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 word_freq_make 4601 non-null float64 1 word_freq_address 4601 non-null float64 2 word_freq_all 4601 non-null float64 3 word_freq_3d 4601 non-null float64 4 word_freq_our 4601 non-null float64 5 word_freq_over 4601 non-null float64 6 word_freq_remove 4601 non-null float64 7 word_freq_internet 4601 non-null float64 8 word_freq_order 4601 non-null float64 9 word_freq_mail 4601 non-null float64 10 word_freq_receive 4601 non-null float64 11 word_freq_will 4601 non-null float64 12 word_freq_people 4601 non-null float64 13 word_freq_report 4601 non-null float64 14 word_freq_addresses 4601 non-null float64 15 word_freq_free 4601 non-null float64 16 word_freq_business 4601 non-null float64 17 word_freq_email 4601 non-null float64 18 word_freq_you 4601 non-null float64 19 word_freq_credit 4601 non-null float64 20 word_freq_your 4601 non-null float64 21 word_freq_font 4601 non-null float64 22 word_freq_000 4601 non-null float64 23 word_freq_money 4601 non-null float64 24 word_freq_hp 4601 non-null float64 25 word_freq_hpl 4601 non-null float64 26 word_freq_george 4601 non-null float64 27 word_freq_650 4601 non-null float64 28 word_freq_lab 4601 non-null float64 29 word_freq_labs 4601 non-null float64 30 word_freq_telnet 4601 non-null float64 31 word_freq_857 4601 non-null float64 32 word_freq_data 4601 non-null float64 33 word_freq_415 4601 non-null float64 34 word_freq_85 4601 non-null float64 35 word_freq_technology 4601 non-null float64 36 word_freq_1999 4601 non-null float64 37 word_freq_parts 4601 non-null float64 38 word_freq_pm 4601 non-null float64 39 word_freq_direct 4601 non-null float64 40 word_freq_cs 4601 non-null float64 41 word_freq_meeting 4601 non-null float64 42 word_freq_original 4601 non-null float64 43 word_freq_project 4601 non-null float64 44 word_freq_re 4601 non-null float64 45 word_freq_edu 4601 non-null float64 46 word_freq_table 4601 non-null float64 47 word_freq_conference 4601 non-null float64 48 char_freq_; 4601 non-null float64 49 char_freq_( 4601 non-null float64 50 char_freq_[ 4601 non-null float64 51 char_freq_! 4601 non-null float64 52 char_freq_$ 4601 non-null float64 53 char_freq_# 4601 non-null float64 54 capital_run_length_average 4601 non-null float64 55 capital_run_length_longest 4601 non-null int64 56 capital_run_length_total 4601 non-null int64 57 label 4601 non-null int64 dtypes: float64(55), int64(3) memory usage: 2.0 MB
#This gives us all the statistical summary for each column.
dataset.describe()
| word_freq_make | word_freq_address | word_freq_all | word_freq_3d | word_freq_our | word_freq_over | word_freq_remove | word_freq_internet | word_freq_order | word_freq_mail | ... | char_freq_; | char_freq_( | char_freq_[ | char_freq_! | char_freq_$ | char_freq_# | capital_run_length_average | capital_run_length_longest | capital_run_length_total | label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | ... | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 |
| mean | 0.104553 | 0.213015 | 0.280656 | 0.065425 | 0.312223 | 0.095901 | 0.114208 | 0.105295 | 0.090067 | 0.239413 | ... | 0.038575 | 0.139030 | 0.016976 | 0.269071 | 0.075811 | 0.044238 | 5.191515 | 52.172789 | 283.289285 | 0.394045 |
| std | 0.305358 | 1.290575 | 0.504143 | 1.395151 | 0.672513 | 0.273824 | 0.391441 | 0.401071 | 0.278616 | 0.644755 | ... | 0.243471 | 0.270355 | 0.109394 | 0.815672 | 0.245882 | 0.429342 | 31.729449 | 194.891310 | 606.347851 | 0.488698 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.588000 | 6.000000 | 35.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.065000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.276000 | 15.000000 | 95.000000 | 0.000000 |
| 75% | 0.000000 | 0.000000 | 0.420000 | 0.000000 | 0.380000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.160000 | ... | 0.000000 | 0.188000 | 0.000000 | 0.315000 | 0.052000 | 0.000000 | 3.706000 | 43.000000 | 266.000000 | 1.000000 |
| max | 4.540000 | 14.280000 | 5.100000 | 42.810000 | 10.000000 | 5.880000 | 7.270000 | 11.110000 | 5.260000 | 18.180000 | ... | 4.385000 | 9.752000 | 4.081000 | 32.478000 | 6.003000 | 19.829000 | 1102.500000 | 9989.000000 | 15841.000000 | 1.000000 |
8 rows × 58 columns
# Get the shape of the dataset. the first element represents the number of rows and the second element represents the columns
np.shape(dataset)
(4601, 58)
#check whether there are any missing or null values
dataset.isnull().sum()#This will give number of NaN values in every column.
word_freq_make 0 word_freq_address 0 word_freq_all 0 word_freq_3d 0 word_freq_our 0 word_freq_over 0 word_freq_remove 0 word_freq_internet 0 word_freq_order 0 word_freq_mail 0 word_freq_receive 0 word_freq_will 0 word_freq_people 0 word_freq_report 0 word_freq_addresses 0 word_freq_free 0 word_freq_business 0 word_freq_email 0 word_freq_you 0 word_freq_credit 0 word_freq_your 0 word_freq_font 0 word_freq_000 0 word_freq_money 0 word_freq_hp 0 word_freq_hpl 0 word_freq_george 0 word_freq_650 0 word_freq_lab 0 word_freq_labs 0 word_freq_telnet 0 word_freq_857 0 word_freq_data 0 word_freq_415 0 word_freq_85 0 word_freq_technology 0 word_freq_1999 0 word_freq_parts 0 word_freq_pm 0 word_freq_direct 0 word_freq_cs 0 word_freq_meeting 0 word_freq_original 0 word_freq_project 0 word_freq_re 0 word_freq_edu 0 word_freq_table 0 word_freq_conference 0 char_freq_; 0 char_freq_( 0 char_freq_[ 0 char_freq_! 0 char_freq_$ 0 char_freq_# 0 capital_run_length_average 0 capital_run_length_longest 0 capital_run_length_total 0 label 0 dtype: int64
#If needed, check NaN values in every column using the code: by default axis=0
dataset.isnull().sum(axis = 1)
0 0
1 0
2 0
3 0
4 0
..
4596 0
4597 0
4598 0
4599 0
4600 0
Length: 4601, dtype: int64
# check if there's any duplicate row or column in the dataset
dataset.duplicated()
0 False
1 False
2 False
3 False
4 False
...
4596 False
4597 False
4598 False
4599 False
4600 False
Length: 4601, dtype: bool
# plot a histogram of the dataset.
hist_of_dataset = dataset.hist(figsize = (30,20))
hist_of_dataset
array([[<AxesSubplot: title={'center': 'word_freq_make'}>,
<AxesSubplot: title={'center': 'word_freq_address'}>,
<AxesSubplot: title={'center': 'word_freq_all'}>,
<AxesSubplot: title={'center': 'word_freq_3d'}>,
<AxesSubplot: title={'center': 'word_freq_our'}>,
<AxesSubplot: title={'center': 'word_freq_over'}>,
<AxesSubplot: title={'center': 'word_freq_remove'}>,
<AxesSubplot: title={'center': 'word_freq_internet'}>],
[<AxesSubplot: title={'center': 'word_freq_order'}>,
<AxesSubplot: title={'center': 'word_freq_mail'}>,
<AxesSubplot: title={'center': 'word_freq_receive'}>,
<AxesSubplot: title={'center': 'word_freq_will'}>,
<AxesSubplot: title={'center': 'word_freq_people'}>,
<AxesSubplot: title={'center': 'word_freq_report'}>,
<AxesSubplot: title={'center': 'word_freq_addresses'}>,
<AxesSubplot: title={'center': 'word_freq_free'}>],
[<AxesSubplot: title={'center': 'word_freq_business'}>,
<AxesSubplot: title={'center': 'word_freq_email'}>,
<AxesSubplot: title={'center': 'word_freq_you'}>,
<AxesSubplot: title={'center': 'word_freq_credit'}>,
<AxesSubplot: title={'center': 'word_freq_your'}>,
<AxesSubplot: title={'center': 'word_freq_font'}>,
<AxesSubplot: title={'center': 'word_freq_000'}>,
<AxesSubplot: title={'center': 'word_freq_money'}>],
[<AxesSubplot: title={'center': 'word_freq_hp'}>,
<AxesSubplot: title={'center': 'word_freq_hpl'}>,
<AxesSubplot: title={'center': 'word_freq_george'}>,
<AxesSubplot: title={'center': 'word_freq_650'}>,
<AxesSubplot: title={'center': 'word_freq_lab'}>,
<AxesSubplot: title={'center': 'word_freq_labs'}>,
<AxesSubplot: title={'center': 'word_freq_telnet'}>,
<AxesSubplot: title={'center': 'word_freq_857'}>],
[<AxesSubplot: title={'center': 'word_freq_data'}>,
<AxesSubplot: title={'center': 'word_freq_415'}>,
<AxesSubplot: title={'center': 'word_freq_85'}>,
<AxesSubplot: title={'center': 'word_freq_technology'}>,
<AxesSubplot: title={'center': 'word_freq_1999'}>,
<AxesSubplot: title={'center': 'word_freq_parts'}>,
<AxesSubplot: title={'center': 'word_freq_pm'}>,
<AxesSubplot: title={'center': 'word_freq_direct'}>],
[<AxesSubplot: title={'center': 'word_freq_cs'}>,
<AxesSubplot: title={'center': 'word_freq_meeting'}>,
<AxesSubplot: title={'center': 'word_freq_original'}>,
<AxesSubplot: title={'center': 'word_freq_project'}>,
<AxesSubplot: title={'center': 'word_freq_re'}>,
<AxesSubplot: title={'center': 'word_freq_edu'}>,
<AxesSubplot: title={'center': 'word_freq_table'}>,
<AxesSubplot: title={'center': 'word_freq_conference'}>],
[<AxesSubplot: title={'center': 'char_freq_;'}>,
<AxesSubplot: title={'center': 'char_freq_('}>,
<AxesSubplot: title={'center': 'char_freq_['}>,
<AxesSubplot: title={'center': 'char_freq_!'}>,
<AxesSubplot: title={'center': 'char_freq_$'}>,
<AxesSubplot: title={'center': 'char_freq_#'}>,
<AxesSubplot: title={'center': 'capital_run_length_average'}>,
<AxesSubplot: title={'center': 'capital_run_length_longest'}>],
[<AxesSubplot: title={'center': 'capital_run_length_total'}>,
<AxesSubplot: title={'center': 'label'}>, <AxesSubplot: >,
<AxesSubplot: >, <AxesSubplot: >, <AxesSubplot: >,
<AxesSubplot: >, <AxesSubplot: >]], dtype=object)
# visualize if there's any missing value using a barchart. A blue bar represents the number of missing values.
# in this dataset we have no missing value.
sns.set(rc={'figure.figsize':(17,7)})
miss_vals = pd.DataFrame(dataset.isnull().sum() / len(dataset) * 100)
miss_vals.plot(kind='bar',title='Missing values in percentage',ylabel='percentage')
<AxesSubplot: title={'center': 'Missing values in percentage'}, ylabel='percentage'>
We've successfully analysed the spambase dataset using various exploratory tools like graphs and statistical analysis so as to enable us do an informed data preprocessing.
In this section, we clean the dataset by filling the missing values(if any) with the mean of its column, we split our dataset into training and testing. We need to perform Feature Scaling when we are dealing with Gradient Descent Based algorithms. Scaling has no significant effect on tree based algorithms, so we would only scale it during the Neural network section. this is the section we also encode the target variable if it's not in binary format.
# fill and missing data with te mean of its column.
dataset.fillna(dataset.mean(), inplace=True)
from sklearn.model_selection import train_test_split
# seperate the features from the target variable.
X = dataset.drop('label', axis=1)
y = dataset['label']
# split the dataset into training set of 75% and test sets of 25%. We also set the stratify hyperparameter for equal distribution.
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)
# get the shape of the split data
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((3450, 57), (1151, 57), (3450,), (1151,))
We've been able to fill missing data, and split our dataset into training and testing. scaling will be applied only to the neural network model.
. . .
Here we apply feature engineering and feature selection to get the relevant features. We achieve this by using sklearn variance threshold, pearson correlation coefficient, and DecisionTree Feature_importance_.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import VarianceThreshold
Using scikit-learn Variance threshold to remove features with variance 0
# check for variance within each feature and remove the features with variance=0
var_thres = VarianceThreshold(threshold=0.0) # set the threshold to 0
var_thres.fit(X)
VarianceThreshold()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
VarianceThreshold()
# get the sum of all the features with a variance above 0
sum(var_thres.get_support())
57
# we check how many features have a variance of 0
constant_columns = [column for column in X_train.columns
if column not in X_train.columns[var_thres.get_support()]]
print(len(constant_columns))
0
# drop features with variance of 0. we apply it to only X_train not the whole dataset.
X_train.drop(constant_columns,axis=1)
| word_freq_make | word_freq_address | word_freq_all | word_freq_3d | word_freq_our | word_freq_over | word_freq_remove | word_freq_internet | word_freq_order | word_freq_mail | ... | word_freq_conference | char_freq_; | char_freq_( | char_freq_[ | char_freq_! | char_freq_$ | char_freq_# | capital_run_length_average | capital_run_length_longest | capital_run_length_total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4140 | 0.00 | 0.00 | 1.58 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.230 | 4 | 16 |
| 918 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.0 | 0.218 | 0.087 | 0.000 | 0.174 | 0.174 | 0.437 | 9.186 | 126 | 937 |
| 1250 | 0.00 | 0.00 | 0.84 | 0.0 | 0.84 | 0.0 | 0.84 | 0.00 | 0.00 | 0.00 | ... | 0.0 | 0.000 | 0.388 | 0.000 | 0.776 | 0.129 | 0.000 | 10.375 | 168 | 249 |
| 845 | 0.59 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 1.18 | 0.59 | 0.59 | 1.18 | ... | 0.0 | 0.000 | 0.000 | 0.000 | 0.421 | 0.000 | 0.000 | 6.275 | 46 | 182 |
| 199 | 0.51 | 0.51 | 0.00 | 0.0 | 0.00 | 0.0 | 0.51 | 0.00 | 0.00 | 0.51 | ... | 0.0 | 0.000 | 0.135 | 0.000 | 0.067 | 0.000 | 0.000 | 2.676 | 17 | 91 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3581 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.384 | 4 | 18 |
| 4558 | 0.16 | 0.00 | 0.32 | 0.0 | 0.10 | 0.1 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.0 | 0.025 | 0.017 | 0.008 | 0.000 | 0.008 | 0.008 | 1.318 | 12 | 244 |
| 3172 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 2.428 | 5 | 17 |
| 4257 | 1.47 | 1.47 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 1.47 | ... | 0.0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 2.391 | 21 | 55 |
| 132 | 0.00 | 0.00 | 1.12 | 0.0 | 0.56 | 0.0 | 0.00 | 0.00 | 0.00 | 0.56 | ... | 0.0 | 0.000 | 0.101 | 0.000 | 0.606 | 0.000 | 0.000 | 2.360 | 19 | 144 |
3450 rows × 57 columns
Checking for correlation between features after applying the scikit learn variance threshold
#we check with the X_train not the whole features.
cor = X_train.corr()
# we use the seaborn heatmap to visualise the correlations.
cmap = sns.cm.rocket_r #for reversed color, the darker the more correlated.
plt.figure(figsize=(50, 50)) # set the size of the plot.
# initialise the seaborn heatmap.
ax = sns.heatmap(cor, linewidths=.3, annot=True, fmt=".2", cmap=cmap)#show numbers on the cells: annot=True
# for avoiding reseting labels
ax.tick_params(axis='x', labelrotation=45)
Using Pearson Correlation Coefficient to remove features with high correlations between them.
# code partially gotten from https://www.youtube.com/watch?v=FndwYNcVe0U&list=PLZoTAELRMXVPgjwJ8VyRoqmfNs2CJwhVH&index=3
# Checking for correlation using pearson correlation coefficient
def correlation(dataset, threshold):
col_corr = set() # Set of all the names of correlated columns
corr_matrix = dataset.corr()
for i in range(len(corr_matrix.columns)-1):
for j in range(i):
if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff values.
#We compare the feature correlation with target and drop the ones with lower coeff.
if abs(corr_matrix.iloc[j, len(corr_matrix.columns)-1]) > abs(corr_matrix.iloc[i, len(corr_matrix.columns)-1]):
colname = corr_matrix.columns[i]
else:
colname = corr_matrix.columns[j]
col_corr.add(colname)
return col_corr
# Apply the pearson correlation to our dataset with a threshold of 90%.
corr_features = correlation(dataset, 0.90)
print(corr_features)
print('number of correlated features: '+str(len(set(corr_features))))
{'word_freq_415'}
number of correlated features: 1
# drop the correlated features. the features that are related to target are kept.
dataset_feature_reduce = dataset.drop(corr_features, axis=1)
# get the shape of the reduced dataset
dataset_feature_reduce.shape
(4601, 57)
# get a new X and y from the reduced dataset
X_train, X_test, y_train, y_test = train_test_split(dataset_feature_reduce.drop('label', axis=1), dataset_feature_reduce['label'],
stratify=dataset_feature_reduce['label'], random_state=42)
len(X_train.columns)
56
# visualise it with seaborn heatmap
#reverse the color scheme: the darker the more positive related.
cmap = sns.cm.rocket_r
plt.figure(figsize=(50, 50)) # set the size of the heatmap
#https://stackoverflow.com/questions/39409866/correlation-heatmap
view = sns.heatmap(dataset_feature_reduce.corr(), linewidths=.3, annot=True, fmt=".2", cmap=cmap)#show numbers on the cells: annot=True
# To avoid resetting labels
view.tick_params(axis='x', labelrotation=45) # tilt the x-label by 45 degree.
Let's compare the evaluation results before and after reducing features
#before feature reduction
from sklearn.ensemble import RandomForestClassifier
model_rf = RandomForestClassifier(n_estimators=100)#todo: tune hyper param
scores_rf_featRed = cross_val_score(model_rf, dataset.drop('label', axis=1), dataset_feature_reduce['label'], scoring = "f1_weighted", cv=5)
print(scores_rf_featRed)
scores_rf_featRed.mean()
[0.94876861 0.94077062 0.95759847 0.97274908 0.82143156]
0.9282636688282416
#after feature reduction
model_rf = RandomForestClassifier(n_estimators=100)
scores_rf_featRed = cross_val_score(model_rf, dataset_feature_reduce.drop('label', axis=1), dataset_feature_reduce['label'], scoring = "f1_weighted", cv=5)
print(scores_rf_featRed)
scores_rf_featRed.mean()
[0.9476932 0.94188636 0.95867561 0.97166672 0.82360847]
0.9287060722020681
Applying Adaboost before and after feature reduction
#before feature reduction
# use the AdaBoost classifier with the default base classifier - DecisionTreeClassifier(max_depth=1)
from sklearn.ensemble import AdaBoostClassifier
model_ada = AdaBoostClassifier(n_estimators=100)#todo: tune hyper param
scores_ada_feature_reduce = cross_val_score(model_ada, dataset.drop('label', axis=1), dataset_feature_reduce['label'], scoring = "f1_weighted", cv=5)
print(scores_ada_feature_reduce)
scores_ada_feature_reduce.mean()#best result: 0.9423
[0.94226055 0.94539341 0.9489726 0.95748857 0.82656792]
0.924136609628035
#after feature reduction
model_ada = AdaBoostClassifier(n_estimators=100)
scores_ada_feature_reduce = cross_val_score(model_ada, dataset_feature_reduce.drop('label', axis=1), dataset_feature_reduce['label'], scoring = "f1_weighted", cv=5)
print(scores_ada_feature_reduce)
scores_ada_feature_reduce.mean()
[0.94226055 0.94539341 0.9489726 0.95748857 0.82656792]
0.924136609628035
There's no significant difference before and after dropping the feature because there's only one feature dropped and it was insignifiant to the target variable.
Sorting the features using the feature_importance_ and removing the features with importance <= 0 in relation to the target variable.
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42, max_depth=8,class_weight='balanced')
model.fit(X_train,y_train)
# Get feature importances
importances = model.feature_importances_
# View the important feature on a barplot
feat_importances = pd.DataFrame(importances, index=X_train.columns, columns=["Importance"])
feat_importances.sort_values(by='Importance', ascending=False, inplace=True)
feat_importances.plot(kind='bar', figsize=(16,7))
<AxesSubplot: >
# Append the features greater than 0 to a new array
threshold = 0.00000
important_features = [feature for feature, importance in zip(X_train.columns, importances) if importance > threshold]
important_features
['word_freq_address',
'word_freq_all',
'word_freq_our',
'word_freq_over',
'word_freq_remove',
'word_freq_internet',
'word_freq_order',
'word_freq_mail',
'word_freq_receive',
'word_freq_addresses',
'word_freq_free',
'word_freq_business',
'word_freq_email',
'word_freq_font',
'word_freq_000',
'word_freq_money',
'word_freq_hp',
'word_freq_hpl',
'word_freq_george',
'word_freq_650',
'word_freq_labs',
'word_freq_telnet',
'word_freq_data',
'word_freq_technology',
'word_freq_1999',
'word_freq_direct',
'word_freq_meeting',
'word_freq_re',
'word_freq_edu',
'char_freq_;',
'char_freq_(',
'char_freq_!',
'char_freq_$',
'char_freq_#',
'capital_run_length_average',
'capital_run_length_longest',
'capital_run_length_total']
# get the length of the new important features
len(important_features)
37
# create a new dataset with only the important features
import_feat = dataset_feature_reduce[important_features]
import_feat.head()
| word_freq_address | word_freq_all | word_freq_our | word_freq_over | word_freq_remove | word_freq_internet | word_freq_order | word_freq_mail | word_freq_receive | word_freq_addresses | ... | word_freq_re | word_freq_edu | char_freq_; | char_freq_( | char_freq_! | char_freq_$ | char_freq_# | capital_run_length_average | capital_run_length_longest | capital_run_length_total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.64 | 0.64 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.000 | 0.778 | 0.000 | 0.000 | 3.756 | 61 | 278 |
| 1 | 0.28 | 0.50 | 0.14 | 0.28 | 0.21 | 0.07 | 0.00 | 0.94 | 0.21 | 0.14 | ... | 0.00 | 0.00 | 0.00 | 0.132 | 0.372 | 0.180 | 0.048 | 5.114 | 101 | 1028 |
| 2 | 0.00 | 0.71 | 1.23 | 0.19 | 0.19 | 0.12 | 0.64 | 0.25 | 0.38 | 1.75 | ... | 0.06 | 0.06 | 0.01 | 0.143 | 0.276 | 0.184 | 0.010 | 9.821 | 485 | 2259 |
| 3 | 0.00 | 0.00 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | 0.63 | 0.31 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.137 | 0.137 | 0.000 | 0.000 | 3.537 | 40 | 191 |
| 4 | 0.00 | 0.00 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | 0.63 | 0.31 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.135 | 0.135 | 0.000 | 0.000 | 3.537 | 40 | 191 |
5 rows × 37 columns
cmap = sns.cm.rocket_r #reverse the color scheme: the darker the more positive related
plt.figure(figsize=(60, 50))
# Plot a heatmap of the important features
heat = sns.heatmap(import_feat.corr(), linewidths=.3, annot=True, fmt=".2", cmap=cmap)
heat.tick_params(axis='x', labelrotation=45)
In this example, we've applied sklearn variance threshold, pearson corr. coef. and decision tree feature_importance_ to extract the features that are relevant in predicting the outcome of an email.
In this section, we'll explore, compare and optimise various classification models,ensemble models and ANN model.
Using Support Vector Machine on the reduced spambase dataset
from sklearn import svm
from sklearn.metrics import accuracy_score # to check the accuracy of the model
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(import_feat, y, stratify=y, test_size=0.25, random_state=42)
# Train the SVM model
clf = svm.SVC(kernel='linear') #Linear Kernel is used when the data is Linearly separable
clf.fit(X_train, y_train)
# Make predictions on the test set
y_pred = clf.predict(X_test)
# Evaluate the model's performance
acc = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(acc * 100))
Accuracy: 92.96%
Using Decision tree classifier
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(max_depth=5, random_state=42)
dtc.fit(X_train, y_train)
# Make predictions on the test set
y_pred = dtc.predict(X_test)
# Evaluate the model's performance
acc = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(acc * 100))
Accuracy: 90.53%
Using SGClassifier
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(loss = 'modified_huber') # "modified_humber" brings tolerance to outliers as well as probability estimates.
sgd_clf.fit(X_train, y_train)
# Make predictions on the test set
y_pred = dtc.predict(X_test)
# Evaluate the model's performance
acc = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(acc * 100))
Accuracy: 90.53%
Using Random Forest to classify the Spambase Dataset
from sklearn.ensemble import RandomForestClassifier
# Train the random forest model
clf = RandomForestClassifier(n_estimators=100,max_depth=5, random_state=42)
clf.fit(X_train, y_train)
# Make predictions on the test set
y_pred = clf.predict(X_test)
# Evaluate the model's performance
acc = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(acc * 100))
Accuracy: 92.27%
Using Ensemble learning techniques boosting(Adaboost Classifier).
from sklearn.ensemble import AdaBoostClassifier
# Train the AdaBoost model
ada_clf = AdaBoostClassifier(n_estimators=100, learning_rate=1.0, random_state=42) #Weight applied to each classifier at each boosting iteration
ada_clf.fit(X_train, y_train)
# Evaluate the model's performance
accuracy = ada_clf.score(X_test, y_test)
print("Accuracy: {:.2f}%".format(acc * 100))
Accuracy: 92.27%
Using Ensemble learning Bagging Technique
from sklearn.ensemble import BaggingClassifier
# Define the base estimator
base_estimator = DecisionTreeClassifier(max_depth=10, random_state=42)
# Train the Bagging model
bag_clf = BaggingClassifier(estimator=base_estimator, n_estimators=100, random_state=42)
bag_clf.fit(X_train, y_train)
# Evaluate the model on the test set
accuracy = bag_clf.score(X_test, y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))
Accuracy: 92.96%
In this section, we use Artificial Neural Network(ANN) to classify the Spambase Dataset.
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler() # define the scaler
df_scaled = pd.DataFrame(sc.fit_transform(import_feat)) # fit & transform the data
print(df_scaled.head())
0 1 2 3 4 5 6 \
0 0.044818 0.125490 0.032 0.000000 0.000000 0.000000 0.000000
1 0.019608 0.098039 0.014 0.047619 0.028886 0.006301 0.000000
2 0.000000 0.139216 0.123 0.032313 0.026135 0.010801 0.121673
3 0.000000 0.000000 0.063 0.000000 0.042641 0.056706 0.058935
4 0.000000 0.000000 0.063 0.000000 0.042641 0.056706 0.058935
7 8 9 ... 27 28 29 30 \
0 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000
1 0.051705 0.080460 0.031746 ... 0.000000 0.000000 0.000000 0.013536
2 0.013751 0.145594 0.396825 ... 0.002801 0.002721 0.002281 0.014664
3 0.034653 0.118774 0.000000 ... 0.000000 0.000000 0.000000 0.014048
4 0.034653 0.118774 0.000000 ... 0.000000 0.000000 0.000000 0.013843
31 32 33 34 35 36
0 0.023955 0.000000 0.000000 0.002502 0.006007 0.017487
1 0.011454 0.029985 0.002421 0.003735 0.010012 0.064836
2 0.008498 0.030651 0.000504 0.008008 0.048458 0.142551
3 0.004218 0.000000 0.000000 0.002303 0.003905 0.011995
4 0.004157 0.000000 0.000000 0.002303 0.003905 0.011995
[5 rows x 37 columns]
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df_scaled, y, stratify=y, test_size=0.25, random_state=42)
# initialize the neural network
neu_net = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, alpha=1e-4,
solver='sgd', verbose=10, tol=1e-4, random_state=1, # solver specifies the algorithm for weight optimization over the nodes.
learning_rate_init=.1)
# train the neural network
neu_net.fit(X_train, y_train)
# evaluate the model
y_pred = neu_net.predict(X_test)
print(classification_report(y_test, y_pred))
Iteration 1, loss = 0.69343255
Iteration 2, loss = 0.63775589
Iteration 3, loss = 0.59498817
Iteration 4, loss = 0.52785481
Iteration 5, loss = 0.45840216
Iteration 6, loss = 0.40233166
Iteration 7, loss = 0.36669967
Iteration 8, loss = 0.34060970
Iteration 9, loss = 0.32209685
Iteration 10, loss = 0.30958613
Iteration 11, loss = 0.30187719
Iteration 12, loss = 0.29017174
Iteration 13, loss = 0.28105919
Iteration 14, loss = 0.27950280
Iteration 15, loss = 0.26965102
Iteration 16, loss = 0.26550145
Iteration 17, loss = 0.26146178
Iteration 18, loss = 0.25868865
Iteration 19, loss = 0.25590298
Iteration 20, loss = 0.25229036
Iteration 21, loss = 0.24932949
Iteration 22, loss = 0.24553931
Iteration 23, loss = 0.24198683
Iteration 24, loss = 0.24024006
Iteration 25, loss = 0.23902999
Iteration 26, loss = 0.23652663
Iteration 27, loss = 0.23735150
Iteration 28, loss = 0.23383407
Iteration 29, loss = 0.23335405
Iteration 30, loss = 0.22848290
Iteration 31, loss = 0.22945971
Iteration 32, loss = 0.22670369
Iteration 33, loss = 0.22988831
Iteration 34, loss = 0.22625644
Iteration 35, loss = 0.22593809
Iteration 36, loss = 0.22222303
Iteration 37, loss = 0.22159780
Iteration 38, loss = 0.22322377
Iteration 39, loss = 0.22084836
Iteration 40, loss = 0.22004358
Iteration 41, loss = 0.21623610
Iteration 42, loss = 0.21601193
Iteration 43, loss = 0.21839138
Iteration 44, loss = 0.21665415
Iteration 45, loss = 0.21471715
Iteration 46, loss = 0.21303147
Iteration 47, loss = 0.21183811
Iteration 48, loss = 0.21213618
Iteration 49, loss = 0.21060671
Iteration 50, loss = 0.21010015
Iteration 51, loss = 0.21087013
Iteration 52, loss = 0.21115919
Iteration 53, loss = 0.20976592
Iteration 54, loss = 0.20881524
Iteration 55, loss = 0.20815008
Iteration 56, loss = 0.20475861
Iteration 57, loss = 0.20538371
Iteration 58, loss = 0.20456340
Iteration 59, loss = 0.20697367
Iteration 60, loss = 0.20763801
Iteration 61, loss = 0.20431188
Iteration 62, loss = 0.20307711
Iteration 63, loss = 0.20353348
Iteration 64, loss = 0.20186221
Iteration 65, loss = 0.20263261
Iteration 66, loss = 0.20268452
Iteration 67, loss = 0.20410869
Iteration 68, loss = 0.19851281
Iteration 69, loss = 0.20041458
Iteration 70, loss = 0.19893609
Iteration 71, loss = 0.19856752
Iteration 72, loss = 0.19520744
Iteration 73, loss = 0.19711629
Iteration 74, loss = 0.20224546
Iteration 75, loss = 0.19738338
Iteration 76, loss = 0.19452250
Iteration 77, loss = 0.19618929
Iteration 78, loss = 0.19542535
Iteration 79, loss = 0.19477962
Iteration 80, loss = 0.19301399
Iteration 81, loss = 0.19336297
Iteration 82, loss = 0.19241697
Iteration 83, loss = 0.19238705
Iteration 84, loss = 0.19896561
Iteration 85, loss = 0.19384432
Iteration 86, loss = 0.19112966
Iteration 87, loss = 0.19233425
Iteration 88, loss = 0.19251401
Iteration 89, loss = 0.19105247
Iteration 90, loss = 0.19009781
Iteration 91, loss = 0.18778175
Iteration 92, loss = 0.18922685
Iteration 93, loss = 0.18841970
Iteration 94, loss = 0.19046763
Iteration 95, loss = 0.18784575
Iteration 96, loss = 0.18821714
Iteration 97, loss = 0.18545463
Iteration 98, loss = 0.18877240
Iteration 99, loss = 0.18624763
Iteration 100, loss = 0.18780378
Iteration 101, loss = 0.18318539
Iteration 102, loss = 0.18432698
Iteration 103, loss = 0.18430055
Iteration 104, loss = 0.18373412
Iteration 105, loss = 0.18163919
Iteration 106, loss = 0.18406481
Iteration 107, loss = 0.18180152
Iteration 108, loss = 0.18445303
Iteration 109, loss = 0.18268441
Iteration 110, loss = 0.18664655
Iteration 111, loss = 0.18374929
Iteration 112, loss = 0.18563278
Iteration 113, loss = 0.18450609
Iteration 114, loss = 0.18402619
Iteration 115, loss = 0.18236231
Iteration 116, loss = 0.18302651
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
precision recall f1-score support
0 0.95 0.92 0.94 697
1 0.88 0.93 0.91 454
accuracy 0.92 1151
macro avg 0.92 0.93 0.92 1151
weighted avg 0.93 0.92 0.92 1151
We were able to train and test our dataset using various classifiers and ensemble techniques, and each one performed well, some better than others. Now we would evaluate them to know which one performed better.
Evaluate the models.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Define the models to evaluate
models = {"Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
"Decision Tree":DecisionTreeClassifier(max_depth=5, random_state=42),
"AdaBoost": AdaBoostClassifier(n_estimators=100, random_state=42),
"SVM":svm.SVC(kernel='linear'),
"SGD":SGDClassifier(loss = 'modified_huber'),
"Bagging": BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=5, random_state=42), n_estimators=100, random_state=42),
"Neural Network": MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, alpha=1e-4,
solver='sgd', verbose=10, tol=1e-4, random_state=1, learning_rate_init=.1)
}
# Evaluate each model
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred) * 100
precision = precision_score(y_test, y_pred) * 100
recall = recall_score(y_test, y_pred) * 100
f1 = f1_score(y_test, y_pred) * 100
print(f"{name}:\n\tAccuracy: {accuracy:.2f}\n\tPrecision: {precision:.2f}\n\tRecall: {recall:.2f}\n\tF1-Score: {f1:.2f}")
Random Forest: Accuracy: 94.53 Precision: 94.74 Recall: 91.19 F1-Score: 92.93 Decision Tree: Accuracy: 90.53 Precision: 89.47 Recall: 86.12 F1-Score: 87.77 AdaBoost: Accuracy: 94.61 Precision: 93.36 Recall: 92.95 F1-Score: 93.16 SVM: Accuracy: 89.05 Precision: 93.39 Recall: 77.75 F1-Score: 84.86 SGD: Accuracy: 91.14 Precision: 86.67 Recall: 91.63 F1-Score: 89.08 Bagging: Accuracy: 91.14 Precision: 93.35 Recall: 83.48 F1-Score: 88.14 Iteration 1, loss = 0.69343255 Iteration 2, loss = 0.63775589 Iteration 3, loss = 0.59498817 Iteration 4, loss = 0.52785481 Iteration 5, loss = 0.45840216 Iteration 6, loss = 0.40233166 Iteration 7, loss = 0.36669967 Iteration 8, loss = 0.34060970 Iteration 9, loss = 0.32209685 Iteration 10, loss = 0.30958613 Iteration 11, loss = 0.30187719 Iteration 12, loss = 0.29017174 Iteration 13, loss = 0.28105919 Iteration 14, loss = 0.27950280 Iteration 15, loss = 0.26965102 Iteration 16, loss = 0.26550145 Iteration 17, loss = 0.26146178 Iteration 18, loss = 0.25868865 Iteration 19, loss = 0.25590298 Iteration 20, loss = 0.25229036 Iteration 21, loss = 0.24932949 Iteration 22, loss = 0.24553931 Iteration 23, loss = 0.24198683 Iteration 24, loss = 0.24024006 Iteration 25, loss = 0.23902999 Iteration 26, loss = 0.23652663 Iteration 27, loss = 0.23735150 Iteration 28, loss = 0.23383407 Iteration 29, loss = 0.23335405 Iteration 30, loss = 0.22848290 Iteration 31, loss = 0.22945971 Iteration 32, loss = 0.22670369 Iteration 33, loss = 0.22988831 Iteration 34, loss = 0.22625644 Iteration 35, loss = 0.22593809 Iteration 36, loss = 0.22222303 Iteration 37, loss = 0.22159780 Iteration 38, loss = 0.22322377 Iteration 39, loss = 0.22084836 Iteration 40, loss = 0.22004358 Iteration 41, loss = 0.21623610 Iteration 42, loss = 0.21601193 Iteration 43, loss = 0.21839138 Iteration 44, loss = 0.21665415 Iteration 45, loss = 0.21471715 Iteration 46, loss = 0.21303147 Iteration 47, loss = 0.21183811 Iteration 48, loss = 0.21213618 Iteration 49, loss = 0.21060671 Iteration 50, loss = 0.21010015 Iteration 51, loss = 0.21087013 Iteration 52, loss = 0.21115919 Iteration 53, loss = 0.20976592 Iteration 54, loss = 0.20881524 Iteration 55, loss = 0.20815008 Iteration 56, loss = 0.20475861 Iteration 57, loss = 0.20538371 Iteration 58, loss = 0.20456340 Iteration 59, loss = 0.20697367 Iteration 60, loss = 0.20763801 Iteration 61, loss = 0.20431188 Iteration 62, loss = 0.20307711 Iteration 63, loss = 0.20353348 Iteration 64, loss = 0.20186221 Iteration 65, loss = 0.20263261 Iteration 66, loss = 0.20268452 Iteration 67, loss = 0.20410869 Iteration 68, loss = 0.19851281 Iteration 69, loss = 0.20041458 Iteration 70, loss = 0.19893609 Iteration 71, loss = 0.19856752 Iteration 72, loss = 0.19520744 Iteration 73, loss = 0.19711629 Iteration 74, loss = 0.20224546 Iteration 75, loss = 0.19738338 Iteration 76, loss = 0.19452250 Iteration 77, loss = 0.19618929 Iteration 78, loss = 0.19542535 Iteration 79, loss = 0.19477962 Iteration 80, loss = 0.19301399 Iteration 81, loss = 0.19336297 Iteration 82, loss = 0.19241697 Iteration 83, loss = 0.19238705 Iteration 84, loss = 0.19896561 Iteration 85, loss = 0.19384432 Iteration 86, loss = 0.19112966 Iteration 87, loss = 0.19233425 Iteration 88, loss = 0.19251401 Iteration 89, loss = 0.19105247 Iteration 90, loss = 0.19009781 Iteration 91, loss = 0.18778175 Iteration 92, loss = 0.18922685 Iteration 93, loss = 0.18841970 Iteration 94, loss = 0.19046763 Iteration 95, loss = 0.18784575 Iteration 96, loss = 0.18821714 Iteration 97, loss = 0.18545463 Iteration 98, loss = 0.18877240 Iteration 99, loss = 0.18624763 Iteration 100, loss = 0.18780378 Iteration 101, loss = 0.18318539 Iteration 102, loss = 0.18432698 Iteration 103, loss = 0.18430055 Iteration 104, loss = 0.18373412 Iteration 105, loss = 0.18163919 Iteration 106, loss = 0.18406481 Iteration 107, loss = 0.18180152 Iteration 108, loss = 0.18445303 Iteration 109, loss = 0.18268441 Iteration 110, loss = 0.18664655 Iteration 111, loss = 0.18374929 Iteration 112, loss = 0.18563278 Iteration 113, loss = 0.18450609 Iteration 114, loss = 0.18402619 Iteration 115, loss = 0.18236231 Iteration 116, loss = 0.18302651 Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping. Neural Network: Accuracy: 92.44 Precision: 88.31 Recall: 93.17 F1-Score: 90.68
After applying data preprocessing, feature engineering and selection, training and testing of the spambase dataset, both individually and combined through the evaluation process, the model with the highest precision scores are Adaboost and random forest with an accuracy of 94.61 and 94.53 respectively. Other models performed well like the ANN with a score of 92.44. A decrease in the learning rate of the ANN to 0.01 showed a slight improvement to 92.96 but it took a longer time to execute. SVM had the least accuracy score of 89.04. It is recommended that Adaboost should be used when building a model to classify emails as Spam or Ham as its accuracy is higher than other classifiers.